{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Python字符串和文本"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "几乎所有的程序不管是解析数据还是产生输出都会涉及到文本处理。这一章将重点关注文本的操作处理，比如提取字符串，搜索，替换以及解析等。大部分的问题都能调用字符串的内建方法完成。但是，一些更为复杂的操作可能需要正则表达式或者强大的解析器，这些我们都会进行详细讲解。并且也会提及在操作Unicode时候碰到的一些棘手的问题。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 使用多个界定符分割字符串"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`string`对象的`split()`方法只适应于非常简单的字符串分割情形，它并不允许有多个分隔符或者是分隔符周围不确定的空格。当需要更加灵活的切割字符串的时候，最好使用`re.split()`方法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "line = 'asdf fjdk; afed, fjek,asdf, foo'\n",
    "import re\n",
    "re.split(r'[;,\\s]\\s*', line)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "函数`re.split()`非常实用，它允许为分隔符指定多个正则模式。只要这个模式被找到，那么匹配的分隔符两边的实体都会被当成是结果中的元素返回。返回结果为一个字段列表，这个跟`str.split()`返回值类型是一样的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fields = re.split(r'(;|,|\\s)\\s*', line)\n",
    "fields"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "获取分割字符在某些情况下也是有用的。比如，保留分割字符串，用来在后面重新构造一个新的输出字符串："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "values = fields[::2]\n",
    "delimiters = fields[1::2] + ['']\n",
    "values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[' ', ';', ',', ',', ',', '']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "delimiters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'asdf fjdk;afed,fjek,asdf,foo'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "''.join(v + d for v, d in zip(values, delimiters))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果不想保留分割字符串到结果列表中去，但仍然需要使用到括号来分组正则表达式的话，确保你的分组是非捕获分组，形如`(?:...)`。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.split(r'(?:,|;|\\s)\\s*', line)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 字符串开头或结尾匹配"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "检查字符串开头或结尾的一个简单方法是使用`str.startswith()`或者是`str.endswith()`方法。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "filename = 'spam.txt'\n",
    "filename.endswith('.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "filename.startswith('file:')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url = 'http://www.python.org'\n",
    "url.startswith('http:')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "想检查多种匹配可能，只需要将所有的匹配项放入到一个元组中去，然后传给`startswith()`或者`endswith()`方法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['.ipynb_checkpoints',\n",
       " 'AIScanner API接口使用文档.ipynb',\n",
       " 'example.txt',\n",
       " 'MovieLens 1M数据集.ipynb',\n",
       " 'movies.dat',\n",
       " 'Pyhton数据结构和算法.ipynb',\n",
       " 'Python字符串和文本.ipynb',\n",
       " 'ratings.dat',\n",
       " 'users.dat',\n",
       " '来自bit.ly的1.usa.gov数据.ipynb']"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os\n",
    "filenames = os.listdir()\n",
    "filenames"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['AIScanner API接口使用文档.ipynb',\n",
       " 'MovieLens 1M数据集.ipynb',\n",
       " 'movies.dat',\n",
       " 'Pyhton数据结构和算法.ipynb',\n",
       " 'Python字符串和文本.ipynb',\n",
       " 'ratings.dat',\n",
       " 'users.dat',\n",
       " '来自bit.ly的1.usa.gov数据.ipynb']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[name for name in filenames if name.endswith(('.ipynb', '.dat'))]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "any(name.endswith('.txt') for name in filenames)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个方法中必须要输入一个元组作为参数。如果恰巧有一个`list`或者`set`类型的选择项，要确保传递参数前先调用`tuple()`将其转换为元组类型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "from urllib.request import urlopen\n",
    "\n",
    "def read_data(name):\n",
    "    if name.startswith(('http:', 'https:', 'ftp:')):\n",
    "        return urlopen(name).read()\n",
    "    else:\n",
    "        with open(name) as f:\n",
    "            return f.read()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "ename": "TypeError",
     "evalue": "startswith first arg must be str or a tuple of str, not list",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-14-25e4786c96e2>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m      1\u001b[0m \u001b[0mchoices\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m[\u001b[0m\u001b[1;34m'http:'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'ftp:'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      2\u001b[0m \u001b[0murl\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;34m'http://www.python.org'\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 3\u001b[1;33m \u001b[0murl\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mstartswith\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mchoices\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;31mTypeError\u001b[0m: startswith first arg must be str or a tuple of str, not list"
     ]
    }
   ],
   "source": [
    "choices = ['http:', 'ftp:']\n",
    "url = 'http://www.python.org'\n",
    "url.startswith(choices)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url.startswith(tuple(choices))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`startswith()`和`endswith()`方法提供了一个非常方便的方式去做字符串开头和结尾的检查。类似的操作也可以使用切片来实现，但是代码看起来没有那么优雅。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "filename = 'spam.txt'\n",
    "filename[-4:] == '.txt'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url = 'http://www.python.org'\n",
    "url[:5] == 'http:' or url[:6] == 'https:' or url[:4] == 'ftp:'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "还可以使用正则表达式去实现："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(0, 5), match='http:'>"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "url = 'http://www.python.org'\n",
    "re.match('http:|https:|ftp:', url)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这种方式也行得通，但是对于简单的匹配实在是有点小材大用了，本节中的方法更加简单并且运行会更快些。\n",
    "\n",
    "当和其他操作比如普通数据聚合相结合的时候`startswith()`和`endswith()`方法是很不错的。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 用Shell通配符匹配字符串"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`fnmatch`模块提供了两个函数——`fnmatch()`和`fnmatchcase()`，可以用来实现这样的匹配。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from fnmatch import fnmatch, fnmatchcase\n",
    "fnmatch('foo.txt', '*.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fnmatch('foo.txt', '?oo.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fnmatch('Dat45.csv', 'Dat[0-9]*')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Dat1.csv', 'Dat2.csv']"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "names = ['Dat1.csv', 'Dat2.csv', 'config.ini', 'foo.py']\n",
    "[name for name in names if fnmatch(name, 'Dat*.csv')]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`fnmatch()`函数使用底层操作系统的大小写敏感规则（不同的系统是不一样的）来匹配模式。\n",
    "\n",
    "如果对这个区别很在意，可以使用`fnmatchcase()`来代替。它完全使用底层操作系统的大小写模式来匹配。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fnmatch('foo.txt', '*.TXT')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fnmatchcase('foo.txt', '*.TXT')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这两个函数在处理非文件名的字符串时候它们也是很有用的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "addresses = [\n",
    "    '5412 N CLARK ST',\n",
    "    '1060 W ADDISON ST',\n",
    "    '1039 W GRANVILLE AVE',\n",
    "    '2122 N CLARK ST',\n",
    "    '4802 N BROADWAY',\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['5412 N CLARK ST', '1060 W ADDISON ST', '2122 N CLARK ST']"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from fnmatch import fnmatchcase\n",
    "[addr for addr in addresses if fnmatchcase(addr, '* ST')]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['5412 N CLARK ST']"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[addr for addr in addresses if fnmatchcase(addr, '54[0-9][0-9] *CLARK*')]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`fnmatch()`函数匹配能力介于简单的字符串方法和强大的正则表达式之间。如果在数据处理操作中只需要简单的通配符就能完成的时候，这通常是一个比较合理的方案。\n",
    "\n",
    "如果代码需要做文件名的匹配，最好使用`glob`模块。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 字符串匹配和搜索"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果想匹配的是字面字符串，那么通常只需要调用基本字符串方法就行，比如`str.find()`,`str.endswith()`,`str.startswith()`或者类似的方法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = 'yeah, but no, but yeah, but no, but yeah'\n",
    "text == 'yeah'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text.startswith('yeah')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text.endswith('no')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text.find('no')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对于复杂的匹配需要使用正则表达式和`re`模块。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "yes\n"
     ]
    }
   ],
   "source": [
    "text1 = '11/27/2012'\n",
    "text2 = 'Nov 27, 2012'\n",
    "\n",
    "import re\n",
    "if re.match(r'\\d+/\\d+/\\d+', text1):\n",
    "    print('yes')\n",
    "else:\n",
    "    print('no')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "no\n"
     ]
    }
   ],
   "source": [
    "if re.match(r'\\d+/\\d+/\\d+', text2):\n",
    "    print('yes')\n",
    "else:\n",
    "    print('no')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果想使用同一个模式去做多次匹配，应该先将模式字符串预编译为模式对象。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "yes\n"
     ]
    }
   ],
   "source": [
    "datepat = re.compile(r'\\d+/\\d+/\\d+')\n",
    "if datepat.match(text1):\n",
    "    print('yes')\n",
    "else:\n",
    "    print('no')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "no\n"
     ]
    }
   ],
   "source": [
    "if datepat.match(text2):\n",
    "    print('yes')\n",
    "else:\n",
    "    print('no')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`match()`总是从字符串开始去匹配，如果想查找字符串的模式出现在任意位置，使用`findall()`方法来代替。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['11/27/2012', '3/13/2013']"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'\n",
    "datepat.findall(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在定义正则式的时候，通常会利用括号去捕获分组。\n",
    "\n",
    "捕获分组可以使得后面的处理更加简单，因为可以分别将每个组的内容提取出来。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "datepat = re.compile(r'(\\d+)/(\\d+)/(\\d+)')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(0, 10), match='11/27/2012'>"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m = datepat.match('11/27/2012')\n",
    "m"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'11/27/2012'"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.group(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'11'"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.group(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'27'"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.group(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'2012'"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.group(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('11', '27', '2012')"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.groups()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Today is 11/27/2012. PyCon starts 3/13/2013.'"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "month, day, year = m.groups()\n",
    "text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('11', '27', '2012'), ('3', '13', '2013')]"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "datepat.findall(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2012-11-27\n",
      "2013-3-13\n"
     ]
    }
   ],
   "source": [
    "for month, day, year in datepat.findall(text):\n",
    "    print('{}-{}-{}'.format(year, month, day))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`findall()`方法会搜索文本并以列表形式返回所有的匹配。如果想以迭代方式返回匹配，可以使用`finditer()`方法来代替。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('11', '27', '2012')\n",
      "('3', '13', '2013')\n"
     ]
    }
   ],
   "source": [
    "for m in datepat.finditer(text):\n",
    "    print(m.groups())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用`re`模块进行匹配和搜索文本的核心步骤就是先使用`re.compile()`编译正则表达式字符串，然后使用`match()`,`findall()`或者`finditer()`等方法。\n",
    "\n",
    "当写正则式字符串的时候，相对普遍的做法是使用原始字符串比如`r'(\\d+)/(\\d+)/(\\d+)'`。这种字符串将不去解析反斜杠。如果不这样做，必须使用两个反斜杠，类似`'(\\\\d+)/(\\\\d+)/(\\\\d+)'`。\n",
    "\n",
    "需要注意的是`match()`方法仅仅检查字符串的开始部分。它的匹配结果有可能并不是期望的那样。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(0, 10), match='11/27/2012'>"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m = datepat.match('11/27/2012abcdef')\n",
    "m"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m.group()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果想精确匹配，确保正则表达式以`$`结尾："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(0, 10), match='11/27/2012'>"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "datepat = re.compile(r'(\\d+)/(\\d+)/(\\d+)$')\n",
    "datepat.match('11/27/2012abcdef')\n",
    "datepat.match('11/27/2012')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果仅仅是做一次简单的文本匹配/搜索操作的话，可以略过编译部分，直接使用`re`模块的函数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('11', '27', '2012'), ('3', '13', '2013')]"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.findall(r'(\\d+)/(\\d+)/(\\d+)', text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "需要注意的是，如果打算做大量的匹配和搜索操作的话，最好先编译正则表达式，然后再重复使用它。模块级别的函数会将最近编译过的模式缓存起来，因此并不会消耗太多的性能，但是如果使用预编译模式的话，将会减少查找和一些额外的处理损耗。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.5 字符串搜索和替换"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对于简单的字面模式，直接使用`str.repalce()`方法即可："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'yep, but no, but yep, but no, but yep'"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = 'yeah, but no, but yeah, but no, but yeah'\n",
    "text.replace('yeah', 'yep')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对于复杂的模式，使用`re`模块中的`sub()`函数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Today is 2012-11-27. PyCon starts 2013-3-13.'"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'\n",
    "import re\n",
    "re.sub(r'(\\d+)/(\\d+)/(\\d+)', r'\\3-\\1-\\2', text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`sub()`函数中的第一个参数是被匹配的模式，第二个参数是替换模式。反斜杠数字比如`\\3`指向前面模式的捕获组号。\n",
    "\n",
    "如果打算用相同的模式做多次替换，考虑先编译它来提升性能。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Today is 2012-11-27. PyCon starts 2013-3-13.'"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "datepat = re.compile(r'(\\d+)/(\\d+)/(\\d+)')\n",
    "datepat.sub(r'\\3-\\1-\\2', text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对于更加复杂的替换，可以传递一个替换回调函数来代替："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Today is 27 Nov 2012. PyCon starts 13 Mar 2013.'"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from calendar import month_abbr\n",
    "def change_date(m):\n",
    "    mon_name = month_abbr[int(m.group(1))]\n",
    "    return '{} {} {}'.format(m.group(2), mon_name, m.group(3))\n",
    "\n",
    "datepat.sub(change_date, text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "替换回调函数的参数是一个`match`对象，也就是`match()`或者`find()`返回的对象。使用`group()`方法来提取特定的匹配部分。回调函数最后返回替换字符串。\n",
    "\n",
    "如果除了替换后的结果，还想知道有多少替换发生，可以使用`re.subn()`来代替。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Today is 2012-11-27. PyCon starts 2013-3-13.'"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "newtext, n = datepat.subn(r'\\3-\\1-\\2', text)\n",
    "newtext"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.6 字符串忽略大小写的搜索替换"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在文本操作时忽略大小写，需要在使用`re`模块的时候给这些操作提供`re.IGNORECASE`参数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['PYTHON', 'python', 'Python']"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = 'UPPER PYTHON, lower python, Mixed Python'\n",
    "re.findall('python', text, flags = re.IGNORECASE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'UPPER snake, lower snake, Mixed snake'"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.sub('python', 'snake', text, flags = re.IGNORECASE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "最后那个例子揭示了一个小缺陷，替换字符串并不会自动跟被匹配字符串的大小写保持一致。为了修复这个，可能需要一个辅助函数，就像下面的这样："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [],
   "source": [
    "def matchcase(word):\n",
    "    def replace(m):\n",
    "        text = m.group()\n",
    "        if text.isupper():\n",
    "            return word.upper()\n",
    "        elif text.islower():\n",
    "            return word.lower()\n",
    "        elif text[0].isupper():\n",
    "            return word.capitalize()\n",
    "        else:\n",
    "            return word\n",
    "    return replace"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'UPPER SNAKE, lower snake, Mixed Snake'"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.sub('python', matchcase('snake'), text, flags = re.IGNORECASE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`matchcase('snake')`返回了一个回调函数（参数必须是`match`对象），前一节提到过，`sub()`函数除了接受替换字符串外，还能接受一个回调函数。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.7 最短匹配模式"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在正则表达式中`*`操作符是贪婪的，因此匹配操作会查找最长的可能匹配。\n",
    "\n",
    "可以在模式中的`*`操作符后面加上`?`修饰符，这样就使得匹配变成非贪婪模式，从而得到最短的匹配。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['no.']"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str_pat = re.compile(r'\\\"(.*)\\\"')\n",
    "text1 = 'Computer says \"no.\"'\n",
    "str_pat.findall(text1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['no.\" Phone says \"yes.']"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text2 = 'Computer says \"no.\" Phone says \"yes.\"'\n",
    "str_pat.findall(text2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['no.', 'yes.']"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str_pat = re.compile(r'\\\"(.*?)\\\"')\n",
    "str_pat.findall(text2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.8 多行匹配模式"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在当用点`.`去匹配任意字符的时候，忘记了点`.`不能匹配换行符。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[' this is a comment ']"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "comment = re.compile(r'/\\*(.*?)\\*/')\n",
    "text1 = '/* this is a comment */'\n",
    "text2 = '''/* this is a\n",
    "multiline comment */\n",
    "'''\n",
    "comment.findall(text1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "comment.findall(text2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以修改模式字符串，增加对换行的支持。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[' this is a\\nmultiline comment ']"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "comment = re.compile(r'/\\*((?:.|\\n)*?)\\*/')\n",
    "comment.findall(text2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在这个模式中，`(?:.|\\n)`指定了一个非捕获组（也就是它定义了一个仅仅用来做匹配，而不能通过单独捕获或者编号的组）。\n",
    "\n",
    "`re.compile()`函数接受一个参数叫`re.DOTALL`，它可以让正则表达式中的点`.`匹配包括换行符在内的任意字符。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[' this is a\\nmultiline comment ']"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "comment = re.compile(r'/\\*(.*?)\\*/', re.DOTALL)\n",
    "comment.findall(text2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对于简单的情况使用`re.DOTALL`参数工作的很好，但是如果模式非常复杂或者是为了构造字符串令牌而将多个模式合并起来，这时候使用这个标记参数就可能出现一些问题。如果让你选择的话，最好还是定义自己的正则表达式模式，这样它可以在不需要额外的标记参数下也能工作的很好。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.9 将Unicode文本标准化"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在`Unicode`中，某些字符能够用多个合法的编码表示。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Spicy Jalapeño'"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s1 = 'Spicy Jalape\\u00f1o'\n",
    "s2 = 'Spicy Jalapen\\u0303o'\n",
    "s1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Spicy Jalapeño'"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s1 == s2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "14"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(s1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "15"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(s2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里的文本”Spicy Jalapeño”使用了两种形式来表示。第一种使用整体字符”ñ”（U+00F1），第二种使用拉丁字母”n”后面跟一个”~”的组合字符（U+0303）。\n",
    "\n",
    "在需要比较字符串的程序中使用字符的多种表示会产生问题。为了修正这个问题，可以使用`unicodedata`模块先将文本标准化："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import unicodedata\n",
    "t1 = unicodedata.normalize('NFC', s1)\n",
    "t2 = unicodedata.normalize('NFC', s2)\n",
    "t1 == t2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "'Spicy Jalape\\xf1o'\n"
     ]
    }
   ],
   "source": [
    "print(ascii(t1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t3 = unicodedata.normalize('NFD', s1)\n",
    "t4 = unicodedata.normalize('NFD', s2)\n",
    "t3 == t4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "'Spicy Jalapen\\u0303o'\n"
     ]
    }
   ],
   "source": [
    "print(ascii(t3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`normalize()`第一个参数指定字符串标准化的方式。NFC表示字符应该是整体组成（比如可能的话就使用单一编码），而NFD表示字符应该分解为多个组合字符表示。\n",
    "\n",
    "Python同样支持扩展的标准化形式NFKC和NFKD，它们在处理某些字符的时候增加了额外的兼容特性。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ﬁ'"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = '\\ufb01'\n",
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ﬁ'"
      ]
     },
     "execution_count": 78,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unicodedata.normalize('NFD', s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'fi'"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unicodedata.normalize('NFKD', s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'fi'"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unicodedata.normalize('NFKC', s)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "标准化对于任何需要以一致的方式处理Unicode文本的程序都是非常重要的。当处理来自用户输入的字符串而你很难去控制编码的时候尤其如此。\n",
    "\n",
    "在清理和过滤文本的时候字符的标准化也是很重要的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Spicy Jalapeno'"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t1 = unicodedata.normalize('NFD', s1)\n",
    "''.join(c for c in t1 if not unicodedata.combining(c))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "最后一个例子展示了`unicodedata`模块的另一个重要方面，也就是测试字符类的工具函数。`combining()`函数可以测试一个字符是否为和音字符。在这个模块中还有其他函数用于查找字符类别，测试是否为数字字符等等。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.10 在正则式中使用Unicode"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "默认情况下`re`模块已经对一些Unicode字符类有了基本的支持。比如，`\\\\d`已经匹配任意的unicode数字字符了："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(0, 3), match='123'>"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "num = re.compile('\\d+')\n",
    "num.match('123')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(0, 3), match='١٢٣'>"
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "num.match('\\u0661\\u0662\\u0663')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果想在模式中包含指定的Unicode字符，可以使用Unicode字符对应的转义序列（比如`\\uFFF`或者`\\UFFFFFFF`）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [],
   "source": [
    "arabic = re.compile('[\\u0600-\\u06ff\\u0750-\\u077f\\u08a0-\\u08ff]+')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当执行匹配和搜索操作的时候，最好是先标准化并且清理所有文本为标准化格式。但是同样也应该注意一些特殊情况，比如在忽略大小写匹配和大小写转换时的行为。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(0, 6), match='straße'>"
      ]
     },
     "execution_count": 85,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pat = re.compile('stra\\u00dfe', re.IGNORECASE)\n",
    "s = 'straße'\n",
    "pat.match(s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'STRASSE'"
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pat.match(s.upper())\n",
    "s.upper()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "混合使用Unicode和正则表达式通常会让你抓狂。如果你真的打算这样做的话，最好考虑下安装第三方正则式库，它们会为Unicode的大小写转换和其他大量有趣特性提供全面的支持，包括模糊匹配。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.11 删除字符串中不需要的字符"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "想去掉文本字符串开头，结尾或者中间不想要的字符，比如空白。\n",
    "\n",
    "`strip()`方法能用于删除开始或结尾的字符。`lstrip()`和`rstrip()`分别从左和从右执行删除操作。默认情况下，这些方法会去除空白字符，但是也可以指定其他字符。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'hello world'"
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = ' hello world \\n'\n",
    "s.strip()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'hello world \\n'"
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s.lstrip()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "' hello world'"
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s.rstrip()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'hello====='"
      ]
     },
     "execution_count": 90,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t = '-----hello====='\n",
    "t.lstrip('-')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'hello'"
      ]
     },
     "execution_count": 91,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t.strip('-=')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`strip()`方法在读取和清理数据以备后续处理的时候是经常会被用到的。可以用它们来去掉空格，引号和完成其他任务。\n",
    "\n",
    "但是需要注意去除操作不会对字符串的中间的文本产生任何影响。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'hello     world'"
      ]
     },
     "execution_count": 92,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = ' hello     world \\n'\n",
    "s = s.strip()\n",
    "s"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果想处理中间的空格，那么需要使用`replace()`方法或者是用正则表达式替换。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'helloworld'"
      ]
     },
     "execution_count": 93,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s.replace(' ', '')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'hello world'"
      ]
     },
     "execution_count": 94,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "re.sub('\\s+', ' ', s)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "将字符串`strip`操作和其他迭代操作相结合，比如从文件中读取多行数据，那么生成器表达式就可以大显身手。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(filename) as f:\n",
    "    lines = (line.strip() for line in f)\n",
    "    for line in lines:\n",
    "        print(line)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "表达式`lines = (line.strip() for line in f)`执行数据转换操作。这种方式非常高效，它不需要预先读取所有数据放到一个临时的列表中，仅仅只创建一个生成器，并且每次返回行之前会先执行`strip`操作。\n",
    "\n",
    "对于更高阶的strip，你可能需要使用`translate()`方法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.12 审查清理文本字符串"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "一般情况下会选择使用字符串函数（比如`str.upper()`和`str.lower()`）将文本转为标准格式。使用`str.replace()`或者`re.sub()`的简单替换操作能删除或者改变指定的字符序列。同样还可以使用`unicodedata.normalize()`函数将unicode文本标准化。\n",
    "\n",
    "为了消除整个区间上的字符或者去除变音符，可以使用经常被忽视的`str.translate()`方法。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'pýtĥöñ\\x0cis\\tawesome\\r\\n'"
      ]
     },
     "execution_count": 95,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = 'pýtĥöñ\\fis\\tawesome\\r\\n'\n",
    "s"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "第一步是清理空白字符。先创建一个小的转换表格然后使用`translate()`方法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'pýtĥöñ is awesome\\n'"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "remap = {\n",
    "    ord('\\t'): ' ',\n",
    "    ord('\\f'): ' ',\n",
    "    ord('\\r'): None\n",
    "}\n",
    "a = s.translate(remap)\n",
    "a"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "空白字符`\\t`和`\\f`已经被重新映射到一个空格。回车字符`\\r`直接被删除。\n",
    "\n",
    "可以以这个表格为基础进一步构建更大的表格。比如，删除所有的和音符："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'pýtĥöñ is awesome\\n'"
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import unicodedata\n",
    "import sys\n",
    "cmb_chrs = dict.fromkeys(c for c in range(sys.maxunicode) if unicodedata.combining(chr(c)))\n",
    "b = unicodedata.normalize('NFD', a)\n",
    "b"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'python is awesome\\n'"
      ]
     },
     "execution_count": 98,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b.translate(cmb_chrs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "通过使用`dict.fromkeys()`方法构造一个字典，每个Unicode和音符作为键，对于的值全部为`None`。\n",
    "\n",
    "然后使用`unicodedata.normalize()`将原始输入标准化为分解形式字符。然后再调用`translate`函数删除所有重音符。也可以被用来删除其他类型的字符（比如控制字符等）。\n",
    "\n",
    "构造一个将所有Unicode数字字符映射到对应的ASCII字符上的表格："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "580"
      ]
     },
     "execution_count": 99,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "digitmap = { c: ord('0') + unicodedata.digit(chr(c)) for c in range(sys.maxunicode) if unicodedata.category(chr(c)) == 'Nd' }\n",
    "len(digitmap)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'123'"
      ]
     },
     "execution_count": 100,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = '\\u0661\\u0662\\u0663'\n",
    "x.translate(digitmap)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "另一种清理文本的技术涉及到I/O解码与编码函数。先对文本做一些初步的清理，然后再结合`encode()`或者`decode()`操作来清除或修改它。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'pýtĥöñ is awesome\\n'"
      ]
     },
     "execution_count": 101,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'python is awesome\\n'"
      ]
     },
     "execution_count": 102,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b = unicodedata.normalize('NFD', a)\n",
    "b.encode('ascii', 'ignore').decode('ascii')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里的标准化操作将原来的文本分解为单独的和音符。接下来的ASCII编码/解码只是简单的一下子丢弃掉那些字符。当然，这种方法仅仅只在最后的目标就是获取到文本对应ACSII表示的时候生效。\n",
    "\n",
    "文本字符清理最主要的问题应该是运行的性能。一般来讲，代码越简单运行越快。对于简单的替换操作，`str.replace()`方法通常是最快的，甚至在需要多次调用的时候。比如，为了清理空白字符，可以这样做："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {},
   "outputs": [],
   "source": [
    "def clean_spaces(s):\n",
    "    s = s.replace('\\r', '')\n",
    "    s = s.replace('\\t', ' ')\n",
    "    s = s.replace('\\f', ' ')\n",
    "    return s"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "测试发现这种方式会比使用`translate()`或者正则表达式要快很多。\n",
    "\n",
    "当需要执行任何复杂字符对字符的重新映射或者删除操作时，`tanslate()`方法会非常的快。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.13 字符串对齐"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对于基本的字符串对齐操作，可以使用字符串的`ljust()`,`rjust()`和`center()`方法。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Hello world         '"
      ]
     },
     "execution_count": 104,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = 'Hello world'\n",
    "text.ljust(20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'         Hello world'"
      ]
     },
     "execution_count": 105,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text.rjust(20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'    Hello world     '"
      ]
     },
     "execution_count": 106,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text.center(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这些方法都接受一个可选的填充字符。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'=========Hello world'"
      ]
     },
     "execution_count": 107,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text.rjust(20, '=')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'****Hello world*****'"
      ]
     },
     "execution_count": 108,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text.center(20, '*')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "函数`format()`同样可以用来对齐字符串。要做的就是使用`<`,`>`或者`^`字符后面紧跟一个指定的宽度。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'         Hello world'"
      ]
     },
     "execution_count": 109,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "format(text, '>20')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 110,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Hello world         '"
      ]
     },
     "execution_count": 110,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "format(text, '<20')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 111,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'    Hello world     '"
      ]
     },
     "execution_count": 111,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "format(text, '^20')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果想指定一个非空格的填充字符，将它写到对齐字符的前面即可："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 112,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'=========Hello world'"
      ]
     },
     "execution_count": 112,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "format(text, '=>20s')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 113,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'****Hello world*****'"
      ]
     },
     "execution_count": 113,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "format(text, '*^20s')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当格式化多个值的时候，这些格式代码也可以被用在`format()`方法中。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 114,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'     Hello      World'"
      ]
     },
     "execution_count": 114,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'{:>10s} {:>10s}'.format('Hello', 'World')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "format()函数不仅适用于字符串，也可用来格式化任何值，使得它非常的通用。比如，可以用它来格式化数字："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 115,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'    1.2345'"
      ]
     },
     "execution_count": 115,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = 1.2345\n",
    "format(x, '>10')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 116,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'   1.23   '"
      ]
     },
     "execution_count": 116,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "format(x, '^10.2f')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在老的代码中，经常会看到被用来格式化文本的`%`操作符。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Hello world         '"
      ]
     },
     "execution_count": 117,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'%-20s' % text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 118,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'         Hello world'"
      ]
     },
     "execution_count": 118,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'%20s' % text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "但是，在新版本代码中，应该优先选择`format()`函数。`format()`要比`%`操作符的功能更为强大。并且`format()`也比`ljust()`,`rjust()`或`center()`方法更通用，因为它可以用来格式化任意对象，而不仅仅是字符串。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.14 合并拼接字符串"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "想要合并的字符串是在一个序列或者`iterable`中，那么最快的方式就是使用`join()`方法。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 119,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Is Chicago Not Chicago?'"
      ]
     },
     "execution_count": 119,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "parts = ['Is', 'Chicago', 'Not', 'Chicago?']\n",
    "' '.join(parts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 120,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Is,Chicago,Not,Chicago?'"
      ]
     },
     "execution_count": 120,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "','.join(parts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 121,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'IsChicagoNotChicago?'"
      ]
     },
     "execution_count": 121,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "''.join(parts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这种语法看上去比较怪，但是`join()`被指定为字符串的一个方法。这样做的原因是去连接的对象可能来自各种不同的数据序列（比如列表，元组，字典，文件，集合或生成器等），如果在所有这些对象上都定义一个`join()`方法明显是冗余的。因此只需要指定想要的分割字符串并调用他的`join()`方法去将文本片段组合起来。\n",
    "\n",
    "仅仅只是合并少数几个字符串，使用加号`+`通常已经足够了："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Is Chicago Not Chicago?'"
      ]
     },
     "execution_count": 122,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a = 'Is Chicago'\n",
    "b = 'Not Chicago?'\n",
    "a + ' ' + b"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "加号`+`操作符也可作为一些复杂字符串格式化的替代方案，比如："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 123,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Is Chicago Not Chicago?\n"
     ]
    }
   ],
   "source": [
    "print('{} {}'.format(a, b))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Is Chicago Not Chicago?\n"
     ]
    }
   ],
   "source": [
    "print(a + ' ' + b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在源码中将两个字面字符串合并起来，只需要简单的将它们放到一起，不需要用加号`+`。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 125,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'HelloWorld'"
      ]
     },
     "execution_count": 125,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a = 'Hello' 'World'\n",
    "a"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用加号`+`操作符去连接大量的字符串的时候是非常低效率的，因为加号连接会引起内存复制以及垃圾回收操作。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 126,
   "metadata": {},
   "outputs": [],
   "source": [
    "s = ''\n",
    "for p in parts:\n",
    "    s += p"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这种写法会比使用`join()`方法运行的要慢一些，因为每一次执行`+=`操作的时候会创建一个新的字符串对象。最好是先收集所有的字符串片段然后再将它们连接起来。\n",
    "\n",
    "一个相对比较聪明的技巧是利用生成器表达式转换数据为字符串的同时合并字符串，比如："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 127,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ABCD,50,91.1'"
      ]
     },
     "execution_count": 127,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = ['ABCD', 50, 91.1]\n",
    "','.join(str(d) for d in data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "同样还得注意不必要的字符串连接操作。有时候程序员在没有必要做连接操作的时候仍然多此一举。比如在打印的时候："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "HelloWorld:Not Chicago?\n"
     ]
    }
   ],
   "source": [
    "print(a + ':' + b)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 129,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "HelloWorld;Not Chicago?\n"
     ]
    }
   ],
   "source": [
    "print(';'.join([a, b]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 130,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "HelloWorld:Not Chicago?\n"
     ]
    }
   ],
   "source": [
    "print(a ,b, sep = ':')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当混合使用I/O操作和字符串连接操作的时候，有时候需要仔细研究程序。比如，考虑下面的两端代码片段："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f.write(chunk1 + chunk2)\n",
    "f.write(chunk1)\n",
    "f.write(chunk2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果两个字符串很小，那么第一个版本性能会更好些，因为I/O系统调用天生就慢。另外一方面，如果两个字符串很大，那么第二个版本可能会更加高效，因为它避免了创建一个很大的临时结果并且要复制大量的内存块数据。\n",
    "\n",
    "如果准备编写构建大量小字符串的输出代码，最好考虑下使用生成器函数，利用yield语句产生输出片段。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 132,
   "metadata": {},
   "outputs": [],
   "source": [
    "def sample():\n",
    "    yield 'Is'\n",
    "    yield 'Chicago'\n",
    "    yield 'Not'\n",
    "    yield 'Chicago?'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这种方法一个有趣的方面是它并没有对输出片段到底要怎样组织做出假设。例如，可以简单的使用`join()`方法将这些片段合并起来："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 133,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = ''.join(sample())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "或者也可以将字符串片段重定向到I/O："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for part in sample():\n",
    "    f.write(part)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "再或者还可以写出一些结合I/O操作的混合方案："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "metadata": {},
   "outputs": [],
   "source": [
    "def combine(source, maxsize):\n",
    "    parts = []\n",
    "    size = 0\n",
    "    for part in source:\n",
    "        parts.append(part)\n",
    "        size += len(part)\n",
    "        if size > maxsize:\n",
    "            yield ''.join(parts)\n",
    "            parts = []\n",
    "            size = 0\n",
    "        yield ''.join(parts)\n",
    "        \n",
    "with open('filename', 'w') as f:\n",
    "    for part in combine(sample(), 32768):\n",
    "        f.write(part)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "原始的生成器函数并不需要知道使用细节，它只负责生成字符串片段就行了。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.15 字符串中插入变量"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Python并没有对在字符串中简单替换变量值提供直接的支持。但是通过使用字符串的`format()`方法来解决这个问题。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 135,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Guido has 37 messages.'"
      ]
     },
     "execution_count": 135,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = '{name} has {n} messages.'\n",
    "s.format(name = 'Guido', n = 37)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "要被替换的变量能在变量域中找到，可以结合使用`format_map()`和`vars()`。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 136,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Guido has 37 messages.'"
      ]
     },
     "execution_count": 136,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "name = 'Guido'\n",
    "n = 37\n",
    "s.format_map(vars())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`vars()`也适用于对象实例。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 137,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Guido has 37 messages.'"
      ]
     },
     "execution_count": 137,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "class Info:\n",
    "    def __init__(self, name, n):\n",
    "        self.name = name\n",
    "        self.n = n\n",
    "        \n",
    "a = Info('Guido', 37)\n",
    "s.format_map(vars(a))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`format`和`format_map()`并不能很好的处理变量缺失的情况，比如："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 138,
   "metadata": {},
   "outputs": [
    {
     "ename": "KeyError",
     "evalue": "'n'",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-138-92f0f04854f3>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0ms\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mname\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;34m'Guido'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;31mKeyError\u001b[0m: 'n'"
     ]
    }
   ],
   "source": [
    "s.format(name = 'Guido')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "避免这种错误的方法是另外定义一个含有`__missing__()`方法的字典对象，就像下面这样："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 139,
   "metadata": {},
   "outputs": [],
   "source": [
    "class safesub(dict):\n",
    "    def __missing__(self, key):\n",
    "        return '{' + key + '}'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在可以利用这个类包装输入后传递给`format_map()`："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 140,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Guido has {n} messages.'"
      ]
     },
     "execution_count": 140,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "del n\n",
    "s.format_map(safesub(vars()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果在代码中频繁的执行这些步骤，可以将变量替换步骤用一个工具函数封装起来。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 141,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "\n",
    "def sub(text):\n",
    "    return text.format_map(safesub(sys._getframe(1).f_locals))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 142,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hello Guido\n"
     ]
    }
   ],
   "source": [
    "name = 'Guido'\n",
    "n = 37\n",
    "print(sub('Hello {name}'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 143,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "You have 37 messages.\n"
     ]
    }
   ],
   "source": [
    "print(sub('You have {n} messages.'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 144,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "You favorite color is {color}\n"
     ]
    }
   ],
   "source": [
    "print(sub('You favorite color is {color}'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "多年以来由于Python缺乏对变量替换的内置支持而导致了各种不同的解决方案。有时候会看到像下面这样的字符串格式化代码："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 146,
   "metadata": {},
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "unsupported format character 'm' (0x6d) at index 17",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-146-6f8d6f7d4c5b>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m      1\u001b[0m \u001b[0mname\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;34m'Guido'\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      2\u001b[0m \u001b[0mn\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;36m37\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 3\u001b[1;33m \u001b[1;34m'%(name) has %(n) messages.'\u001b[0m \u001b[1;33m%\u001b[0m \u001b[0mvars\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;31mValueError\u001b[0m: unsupported format character 'm' (0x6d) at index 17"
     ]
    }
   ],
   "source": [
    "name = 'Guido'\n",
    "n = 37\n",
    "'%(name) has %(n) messages.' % vars()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可能还会看到字符串模板的使用："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 147,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Guido has 37 messages.'"
      ]
     },
     "execution_count": 147,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import string\n",
    "s = string.Template('$name has $n messages.')\n",
    "s.substitute(vars())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`format()`和`format_map()`比上面这些方案更加先进，应该被优先选择。使用`format()`方法还有一个好处就是可以获得对字符串格式化的所有支持（对齐，填充，数字格式化等待），而这些特性是使用像模板字符串之类的方案不可能获得的。\n",
    "\n",
    "映射或者字典类中鲜为人知的`__missing__()`方法可以定义如何处理缺失的值。在`SafeSub`类中，这个方法被定义为对缺失的值返回一个占位符。缺失的值会出现在结果字符串中（在调试的时候可能很有用），而不是产生一个`KeyError`异常。\n",
    "\n",
    "`sub()`函数使用`sys._getframe(1)`返回调用者的栈帧。访问属性`f_locals`来获得局部变量。绝大部分情况下在代码中去直接操作栈帧是不推荐的。但是，对于像字符串替换工具函数而言它是非常有用的。值得注意的是`f_locals`是一个复制调用函数的本地变量的字典。尽管可以改变`f_locals`的内容，但是这个修改对于后面的变量访问没有任何影响。所以，虽说访问一个栈帧看上去很邪恶，但是对它的任何操作不会覆盖和改变调用者本地变量的值。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.16 以指定列宽格式化字符串"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用`textwrap`模块来格式化字符串的输出。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 148,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Look into my eyes, look into my eyes, the eyes, the eyes, the eyes,\n",
      "not around the eyes, don't look around the eyes, look into my eyes,\n",
      "you're under.\n"
     ]
    }
   ],
   "source": [
    "s = \"Look into my eyes, look into my eyes, the eyes, the eyes, \\\n",
    "the eyes, not around the eyes, don't look around the eyes, \\\n",
    "look into my eyes, you're under.\"\n",
    "import textwrap\n",
    "print(textwrap.fill(s, 70))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 149,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Look into my eyes, look into my eyes,\n",
      "the eyes, the eyes, the eyes, not around\n",
      "the eyes, don't look around the eyes,\n",
      "look into my eyes, you're under.\n"
     ]
    }
   ],
   "source": [
    "print(textwrap.fill(s, 40))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 150,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    Look into my eyes, look into my\n",
      "eyes, the eyes, the eyes, the eyes, not\n",
      "around the eyes, don't look around the\n",
      "eyes, look into my eyes, you're under.\n"
     ]
    }
   ],
   "source": [
    "print(textwrap.fill(s, 40, initial_indent = '    '))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 151,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Look into my eyes, look into my eyes,\n",
      "    the eyes, the eyes, the eyes, not\n",
      "    around the eyes, don't look around\n",
      "    the eyes, look into my eyes, you're\n",
      "    under.\n"
     ]
    }
   ],
   "source": [
    "print(textwrap.fill(s, 40, subsequent_indent = '    '))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`fill()`方法接受一些其他可选参数来控制tab、语句结尾等。\n",
    "\n",
    "`textwrap`模块对于字符串打印是非常有用的，特别是当希望输出自动匹配终端大小的时候。可以使用`os.get_terminal_size()`方法来获取终端的大小尺寸。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 152,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "100"
      ]
     },
     "execution_count": 152,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os\n",
    "os.get_terminal_size().columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.17 在字符串中处理html和xml"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "替换文本字符串中的`<`或者`>`，使用`html.escape()`函数可以很容易的完成。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 153,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Elements are written as \"<tag>text</tag>\".\n"
     ]
    }
   ],
   "source": [
    "s = 'Elements are written as \"<tag>text</tag>\".'\n",
    "import html\n",
    "print(s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 154,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Elements are written as &quot;&lt;tag&gt;text&lt;/tag&gt;&quot;.\n"
     ]
    }
   ],
   "source": [
    "print(html.escape(s))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 155,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Elements are written as \"&lt;tag&gt;text&lt;/tag&gt;\".\n"
     ]
    }
   ],
   "source": [
    "print(html.escape(s, quote = False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果正在处理的是ASCII文本，并且想将非ASCII文本对应的编码实体嵌入进去，可以给某些I/O函数传递参数`errors='xmlcharrefreplace'`来达到这个目的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 156,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "b'Spicy Jalape&#241;o'"
      ]
     },
     "execution_count": 156,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = 'Spicy Jalapeño'\n",
    "s.encode('ascii', errors = 'xmlcharrefreplace')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果正在处理HTML或者XML文本，试着先使用一个合适的HTML或者XML解析器。通常情况下，这些工具会自动替换这些编码值。\n",
    "\n",
    "如果接收到了一些含有编码值的原始文本，需要手动去做替换，通常只需要使用HTML或者XML解析器的一些相关工具函数/方法即可。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 157,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "c:\\python36\\lib\\site-packages\\ipykernel_launcher.py:4: DeprecationWarning: The unescape method is deprecated and will be removed in 3.5, use html.unescape() instead.\n",
      "  after removing the cwd from sys.path.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'Spicy \"Jalapeño\".'"
      ]
     },
     "execution_count": 157,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = 'Spicy &quot;Jalape&#241;o&quot.'\n",
    "from html.parser import HTMLParser\n",
    "p = HTMLParser()\n",
    "p.unescape(s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 158,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'The prompt is >>>'"
      ]
     },
     "execution_count": 158,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t = 'The prompt is &gt;&gt;&gt;'\n",
    "from xml.sax.saxutils import unescape\n",
    "unescape(t)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在生成HTML或者XML文本的时候，正确的转换特殊标记字符是很容易被忽视的细节。特别是当使用`print()`函数或者其他字符串格式化来产生输出的时候。使用像`html.escape()`的工具函数可以很容易的解决这类问题。\n",
    "\n",
    "如果想以其他方式处理文本，还有一些函数比如`xml.sax.saxutils.unescapge()`可以帮助你。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.18 字符串令牌解析"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了令牌化字符串，不仅需要匹配模式，还得指定模式的类型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 159,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = 'foo = 23 + 42 * 10'\n",
    "tokens = [('NAME', 'foo'), ('EQ', '='), ('NUM', '23'), ('PLUS', '+'),\n",
    "          ('NUM', '42'), ('TIMES', '*'), ('NUM', '10')]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了执行这样的切分，第一步就是利用命名捕获组的正则表达式来定义所有可能的令牌，包括空格："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 160,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "NAME = r'(?P<NAME>[a-zA-Z][a-zA-Z_0-9]*)'\n",
    "NUM = r'(?P<NUM>\\d+)'\n",
    "PLUS = r'(?P<PLUS>\\+)'\n",
    "TIMES = r'(?P<TIMES>\\*)'\n",
    "EQ = r'(?P<EQ>=)'\n",
    "WS = r'(?P<WS>\\s+)'\n",
    "\n",
    "master_pat = re.compile('|'.join([NAME, NUM, PLUS, TIMES, EQ, WS]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在上面的模式中，`?P<TOKENNAME>`用于给一个模式命名，供后面使用。\n",
    "\n",
    "下一步，为了令牌化，使用`scanner()`方法。这个方法会创建一个`scanner`对象，在这个对象上不断的调用`match()`方法会一步步的扫描目标文本，每步一个匹配。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 161,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(0, 3), match='foo'>"
      ]
     },
     "execution_count": 161,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scanner = master_pat.scanner('foo = 42')\n",
    "scanner.match()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 162,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('NAME', 'foo')"
      ]
     },
     "execution_count": 162,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "_.lastgroup, _.group()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 163,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(3, 4), match=' '>"
      ]
     },
     "execution_count": 163,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scanner.match()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 164,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('WS', ' ')"
      ]
     },
     "execution_count": 164,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "_.lastgroup, _.group()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 165,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(4, 5), match='='>"
      ]
     },
     "execution_count": 165,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scanner.match()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 166,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('EQ', '=')"
      ]
     },
     "execution_count": 166,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "_.lastgroup, _.group()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 167,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(5, 6), match=' '>"
      ]
     },
     "execution_count": 167,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scanner.match()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 168,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('WS', ' ')"
      ]
     },
     "execution_count": 168,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "_.lastgroup, _.group()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 169,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(6, 8), match='42'>"
      ]
     },
     "execution_count": 169,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scanner.match()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 170,
   "metadata": {},
   "outputs": [],
   "source": [
    "scanner.match()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "实际使用的时候，可以很容易的将上述代码打包到一个生成器中："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_token(pat, text):\n",
    "    Token = namedtuple('Token', ['type', 'value'])\n",
    "    scanner = pat.scanner(text)\n",
    "    for m in iter(scanner.match, None):\n",
    "        yield Token(m.lastgroup, m.group())\n",
    "        \n",
    "for tok in generate_token(master_pat, 'foo = 42'):\n",
    "    print(tok)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果想过滤令牌流，可以定义更多的生成器函数或者使用一个生成器表达式。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tokens = (tok for tok in generate_token(master_pat, text) if tok.type != 'WS')\n",
    "for tok in tokens:\n",
    "    print(tok)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "通常来讲令牌化是很多高级文本解析与处理的第一步。必须确认使用正则表达式指定了所有输入中可能出现的文本序列。如果有任何不可匹配的文本出现了，扫描就会直接停止。\n",
    "\n",
    "令牌的顺序也是有影响的。`re`模块会按照指定好的顺序去做匹配。因此，如果一个模式恰好是另一个更长模式的子字符串，那么需要将长模式写在前面。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "LT = r'(?P<LT><)'\n",
    "LE = r'(?P<LE><=)'\n",
    "EQ = r'(?P<EQ>=)'\n",
    "\n",
    "master_pat = re.compile('|'.join([LE, LT, EQ]))\n",
    "# master_pat = re.compile('|'.join([LT, LE, EQ]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "第二个模式是错的，因为它会将文本`<=`匹配为令牌`LT`紧跟着`EQ`，而不是单独的令牌`LE`，这个并不是我们想要的结果。\n",
    "\n",
    "最后，需要留意下子字符串形式的模式。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PRINT = r'(P<PRINT>print)'\n",
    "NAME = r'(P<NAME>[a-zA-Z_][a-zA-Z_0-9]*)'\n",
    "\n",
    "master_pat = re.compile('|'.join([PRINT, NAME]))\n",
    "\n",
    "for tok in generate_token(master_pat, 'printer'):\n",
    "    print(tok)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.19 实现一个简单的递归下降分析器"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "根据一组语法规则解析文本并执行命令，或者构造一个代表输入的抽象语法树。如果语法非常简单，可以自己写这个解析器，而不是使用一些框架。\n",
    "\n",
    "首先要以BNF或者EBNF形式指定一个标准语法。比如，一个简单数学表达式语法可能像下面这样："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "expr ::= expr + term\n",
    "    |   expr - term\n",
    "    |   term\n",
    "    \n",
    "term ::= term * factor\n",
    "    |   term / factor\n",
    "    |   factor\n",
    "    \n",
    "factor ::= ( expr )\n",
    "    |   NUM"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "或者，以EBNF形式："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "expr ::= term { (+|-) term }*\n",
    "\n",
    "term ::= factor { (*|/) factor }*\n",
    "        \n",
    "factor ::= ( expr )\n",
    "    |   NUM"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在EBNF中，被包含在`{...}*`中的规则是可选的。`*`代表0次或多次重复（跟正则表达式中意义是一样的）。\n",
    "\n",
    "现在，如果对BNF的工作机制还不是很明白的话，就把它当做是一组左右符号可相互替换的规则。一般来讲，解析的原理就是利用BNF完成多个替换和扩展以匹配输入文本和语法规则。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "NUM + NUM * NUM"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在此基础上，解析动作会试着去通过替换操作匹配语法到输入令牌："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "expr\n",
    "expr ::= term{ (+|-) term }*\n",
    "expr ::= factor{ (*|/) factor }* { (+|-) term }*\n",
    "expr ::= NUM { (*|/) factor }* { (+|-) term }*\n",
    "expr ::= NUM {(+|-)term}*\n",
    "expr ::= NUM + term { (+|-) term}*\n",
    "expr ::= NUM + factor { (*|/) factor }* { (+|-) term }*\n",
    "expr ::= NUM + NUM { (*|/) factor }* { (+|-) term }*\n",
    "expr ::= NUM + NUM * factor { (*|/) factor }* { (+|-) term }*\n",
    "expr ::= NUM + NUM * NUM { (*|/) factor }* { (+|-) term }*\n",
    "expr ::= NUM + NUM * NUM { (+|-) term }*\n",
    "expr ::= NUM + NUM * NUM"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面所有的解析步骤可能需要花点时间弄明白，但是它们原理都是查找输入并试着去匹配语法规则。第一个输入令牌是NUM，因此替换首先会匹配那个部分。一旦匹配成功，就会进入下一个令牌`+`，以此类推。当已经确定不能匹配下一个令牌的时候，右边的部分（比如`{(*/)factor}*`）就会被清理掉。在一个成功的解析中，整个右边部分会完全展开来匹配输入令牌流。\n",
    "\n",
    "举例来展示如何构建一个递归下降表达式求值程序："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!/usr/bin/env python\n",
    "# -*- encoding: utf-8 -*-\n",
    "\"\"\"\n",
    "Topic: 下降解析器\n",
    "Desc :\n",
    "\"\"\"\n",
    "import re\n",
    "import collections\n",
    "\n",
    "# Token specification\n",
    "NUM = r'(?P<NUM>\\d+)'\n",
    "PLUS = r'(?P<PLUS>\\+)'\n",
    "MINUS = r'(?P<MINUS>-)'\n",
    "TIMES = r'(?P<TIMES>\\*)'\n",
    "DIVIDE = r'(?P<DIVIDE>/)'\n",
    "LPAREN = r'(?P<LPAREN>\\()'\n",
    "RPAREN = r'(?P<RPAREN>\\))'\n",
    "WS = r'(?P<WS>\\s+)'\n",
    "\n",
    "master_pat = re.compile('|'.join([NUM, PLUS, MINUS, TIMES,\n",
    "                                  DIVIDE, LPAREN, RPAREN, WS]))\n",
    "# Tokenizer\n",
    "Token = collections.namedtuple('Token', ['type', 'value'])\n",
    "\n",
    "def generate_tokens(text):\n",
    "    scanner = master_pat.scanner(text)\n",
    "    for m in iter(scanner.match, None):\n",
    "        tok = Token(m.lastgroup, m.group())\n",
    "        if tok.type != 'WS':\n",
    "            yield tok\n",
    "\n",
    "# Parser\n",
    "class ExpressionEvaluator:\n",
    "    '''\n",
    "    Implementation of a recursive descent parser. Each method\n",
    "    implements a single grammar rule. Use the ._accept() method\n",
    "    to test and accept the current lookahead token. Use the ._expect()\n",
    "    method to exactly match and discard the next token on on the input\n",
    "    (or raise a SyntaxError if it doesn't match).\n",
    "    '''\n",
    "\n",
    "    def parse(self, text):\n",
    "        self.tokens = generate_tokens(text)\n",
    "        self.tok = None  # Last symbol consumed\n",
    "        self.nexttok = None  # Next symbol tokenized\n",
    "        self._advance()  # Load first lookahead token\n",
    "        return self.expr()\n",
    "\n",
    "    def _advance(self):\n",
    "        'Advance one token ahead'\n",
    "        self.tok, self.nexttok = self.nexttok, next(self.tokens, None)\n",
    "\n",
    "    def _accept(self, toktype):\n",
    "        'Test and consume the next token if it matches toktype'\n",
    "        if self.nexttok and self.nexttok.type == toktype:\n",
    "            self._advance()\n",
    "            return True\n",
    "        else:\n",
    "            return False\n",
    "\n",
    "    def _expect(self, toktype):\n",
    "        'Consume next token if it matches toktype or raise SyntaxError'\n",
    "        if not self._accept(toktype):\n",
    "            raise SyntaxError('Expected ' + toktype)\n",
    "\n",
    "    # Grammar rules follow\n",
    "    def expr(self):\n",
    "        \"expression ::= term { ('+'|'-') term }*\"\n",
    "        exprval = self.term()\n",
    "        while self._accept('PLUS') or self._accept('MINUS'):\n",
    "            op = self.tok.type\n",
    "            right = self.term()\n",
    "            if op == 'PLUS':\n",
    "                exprval += right\n",
    "            elif op == 'MINUS':\n",
    "                exprval -= right\n",
    "        return exprval\n",
    "\n",
    "    def term(self):\n",
    "        \"term ::= factor { ('*'|'/') factor }*\"\n",
    "        termval = self.factor()\n",
    "        while self._accept('TIMES') or self._accept('DIVIDE'):\n",
    "            op = self.tok.type\n",
    "            right = self.factor()\n",
    "            if op == 'TIMES':\n",
    "                termval *= right\n",
    "            elif op == 'DIVIDE':\n",
    "                termval /= right\n",
    "        return termval\n",
    "\n",
    "    def factor(self):\n",
    "        \"factor ::= NUM | ( expr )\"\n",
    "        if self._accept('NUM'):\n",
    "            return int(self.tok.value)\n",
    "        elif self._accept('LPAREN'):\n",
    "            exprval = self.expr()\n",
    "            self._expect('RPAREN')\n",
    "            return exprval\n",
    "        else:\n",
    "            raise SyntaxError('Expected NUMBER or LPAREN')\n",
    "\n",
    "def descent_parser():\n",
    "    e = ExpressionEvaluator()\n",
    "    print(e.parse('2'))\n",
    "    print(e.parse('2 + 3'))\n",
    "    print(e.parse('2 + 3 * 4'))\n",
    "    print(e.parse('2 + (3 + 4) * 5'))\n",
    "    # print(e.parse('2 + (3 + * 4)'))\n",
    "    # Traceback (most recent call last):\n",
    "    #    File \"<stdin>\", line 1, in <module>\n",
    "    #    File \"exprparse.py\", line 40, in parse\n",
    "    #    return self.expr()\n",
    "    #    File \"exprparse.py\", line 67, in expr\n",
    "    #    right = self.term()\n",
    "    #    File \"exprparse.py\", line 77, in term\n",
    "    #    termval = self.factor()\n",
    "    #    File \"exprparse.py\", line 93, in factor\n",
    "    #    exprval = self.expr()\n",
    "    #    File \"exprparse.py\", line 67, in expr\n",
    "    #    right = self.term()\n",
    "    #    File \"exprparse.py\", line 77, in term\n",
    "    #    termval = self.factor()\n",
    "    #    File \"exprparse.py\", line 97, in factor\n",
    "    #    raise SyntaxError(\"Expected NUMBER or LPAREN\")\n",
    "    #    SyntaxError: Expected NUMBER or LPAREN\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    descent_parser()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "编写一个递归下降解析器的整体思路是比较简单的。开始的时候，先获得所有的语法规则，然后将其转换为一个函数或者方法。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "expr ::= term { ('+'|'-') term }*\n",
    "\n",
    "term ::= factor { ('*'|'/') factor }*\n",
    "        \n",
    "factor ::= '(' expr ')'\n",
    "    |   NUM"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "应该首先将它们转换成一组像下面这样的方法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class ExpressionEvaluator:\n",
    "    ...\n",
    "    def expr(self):\n",
    "    ...\n",
    "    def term(self):\n",
    "    ...\n",
    "    def factor(self):\n",
    "    ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "每个方法要完成的任务很简单 - 它必须从左至右遍历语法规则的每一部分，处理每个令牌，要么处理完语法规则，要么产生一个语法错误。为了这样做，需采用下面的这些实现方法：\n",
    "\n",
    "* 如果规则中的下个符号是另外一个语法规则的名字（比如term或factor），就简单的调用同名的方法即可。这就是该算法中”下降”的由来 - 控制下降到另一个语法规则中去。有时候规则会调用已经执行的方法（比如，在`factor::='('expr')'`中对expr的调用）。这就是算法中”递归”的由来。\n",
    "\n",
    "* 如果规则中下一个符号是个特殊符号（比如()，得查找下一个令牌并确认是一个精确匹配）。如果不匹配，就产生一个语法错误。\n",
    "\n",
    "* 如果规则中下一个符号为一些可能的选择项（比如 + 或 - ），必须对每一种可能情况检查下一个令牌，只有当它匹配一个的时候才能继续。这也是本节示例中`_accept()`方法的目的。它相当于`_expect()`方法的弱化版本，因为如果一个匹配找到了它会继续，但是如果没找到，它不会产生错误而是回滚（允许后续的检查继续进行）。\n",
    "\n",
    "* 对于有重复部分的规则（比如在规则表达式`::=term{('+'|'-')term}*`中），重复动作通过一个while循环来实现。循环主体会收集或处理所有的重复元素直到没有其他元素可以找到。\n",
    "\n",
    "* 一旦整个语法规则处理完成，每个方法会返回某种结果给调用者。这就是在解析过程中值是怎样累加的原理。比如，在表达式求值程序中，返回值代表表达式解析后的部分结果。最后所有值会在最顶层的语法规则方法中合并起来。\n",
    "\n",
    "尽管是一个简单的例子，递归下降解析器可以用来实现非常复杂的解析。比如，Python语言本身就是通过一个递归下降解析器去解释的。如果对此感兴趣，可以通过查看Python源码文件Grammar/Grammar来研究下底层语法机制。看完会发现，通过手动方式去实现一个解析器其实会有很多的局限和不足之处。\n",
    "\n",
    "其中一个局限就是它们不能被用于包含任何左递归的语法规则中。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "items ::= items ',' item\n",
    "    | item"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了这样做，可能会像下面这样使用`items()`方法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def items(self):\n",
    "    itemsval = self.items()\n",
    "    if itemsval and self._accept(','):\n",
    "        itemsval.append(self.item())\n",
    "    else:\n",
    "        itemsval = [ self.item() ]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "唯一的问题是这个方法根本不能工作，它会产生一个无限递归错误。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "expr ::= factor{ ('+'|'-'|'*'|'/') factor }*\n",
    "\n",
    "factor ::= '(' expression ')'\n",
    "    | NUM"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个语法看上去没啥问题，但是它却不能察觉到标准四则运算中的运算符优先级。\n",
    "\n",
    "对于复杂的语法，最好是选择某个解析工具比如PyParsing或者是PLY。下面是使用PLY来重写表达式求值程序的代码："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ply.lex import lex\n",
    "from ply.yacc import yacc\n",
    "\n",
    "# Token list\n",
    "tokens = [ 'NUM', 'PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'LPAREN', 'RPAREN' ]\n",
    "# Ignored characters\n",
    "t_ignore = ' \\t\\n'\n",
    "# Token specifications (as regexs)\n",
    "t_PLUS = r'\\+'\n",
    "t_MINUS = r'-'\n",
    "t_TIMES = r'\\*'\n",
    "t_DIVIDE = r'/'\n",
    "t_LPAREN = r'\\('\n",
    "t_RPAREN = r'\\)'\n",
    "\n",
    "# Token processing functions\n",
    "def t_NUM(t):\n",
    "    r'\\d+'\n",
    "    t.value = int(t.value)\n",
    "    return t\n",
    "\n",
    "# Error handler\n",
    "def t_error(t):\n",
    "    print('Bad character: {!r}'.format(t.value[0]))\n",
    "    t.skip(1)\n",
    "\n",
    "# Build the lexer\n",
    "lexer = lex()\n",
    "\n",
    "# Grammar rules and handler functions\n",
    "def p_expr(p):\n",
    "    '''\n",
    "    expr : expr PLUS term\n",
    "        | expr MINUS term\n",
    "    '''\n",
    "    if p[2] == '+':\n",
    "        p[0] = p[1] + p[3]\n",
    "    elif p[2] == '-':\n",
    "        p[0] = p[1] - p[3]\n",
    "\n",
    "def p_expr_term(p):\n",
    "    '''\n",
    "    expr : term\n",
    "    '''\n",
    "    p[0] = p[1]\n",
    "\n",
    "def p_term(p):\n",
    "    '''\n",
    "    term : term TIMES factor\n",
    "    | term DIVIDE factor\n",
    "    '''\n",
    "    if p[2] == '*':\n",
    "        p[0] = p[1] * p[3]\n",
    "    elif p[2] == '/':\n",
    "        p[0] = p[1] / p[3]\n",
    "\n",
    "def p_term_factor(p):\n",
    "    '''\n",
    "    term : factor\n",
    "    '''\n",
    "    p[0] = p[1]\n",
    "\n",
    "def p_factor(p):\n",
    "    '''\n",
    "    factor : NUM\n",
    "    '''\n",
    "    p[0] = p[1]\n",
    "\n",
    "def p_factor_group(p):\n",
    "    '''\n",
    "    factor : LPAREN expr RPAREN\n",
    "    '''\n",
    "    p[0] = p[2]\n",
    "\n",
    "def p_error(p):\n",
    "    print('Syntax error')\n",
    "\n",
    "parser = yacc()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个程序中，所有代码都位于一个比较高的层次。只需要为令牌写正则表达式和规则匹配时的高阶处理函数即可。而实际的运行解析器，接受令牌等等底层动作已经被库函数实现了。\n",
    "\n",
    "下面是一个怎样使用得到的解析对象的例子："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "parser.parse('2')\n",
    "parser.parse('2+3')\n",
    "parser.parse('2+(3+4)*5')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.20 字节字符串上的字符串操作"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "字节字符串同样也支持大部分和文本字符串一样的内置操作。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 172,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "b'Hello'"
      ]
     },
     "execution_count": 172,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = b'Hello World'\n",
    "data[0:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 173,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 173,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.startswith(b'Hello')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 174,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[b'Hello', b'World']"
      ]
     },
     "execution_count": 174,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.split()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 175,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "b'Hello Cruel World'"
      ]
     },
     "execution_count": 175,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.replace(b'Hello', b'Hello Cruel')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这些操作同样也适用于字节数组。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 176,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "bytearray(b'Hello')"
      ]
     },
     "execution_count": 176,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = bytearray(b'Hello World')\n",
    "data[0:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 177,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 177,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.startswith(b'Hello')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 178,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[bytearray(b'Hello'), bytearray(b'World')]"
      ]
     },
     "execution_count": 178,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.split()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 179,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "bytearray(b'Hello Cruel World')"
      ]
     },
     "execution_count": 179,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.replace(b'Hello', b'Hello Cruel')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以使用正则表达式匹配字节字符串，但是正则表达式本身必须也是字节串。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 180,
   "metadata": {},
   "outputs": [
    {
     "ename": "TypeError",
     "evalue": "cannot use a string pattern on a bytes-like object",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-180-0602da5f83bc>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m      1\u001b[0m \u001b[0mdata\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;34mb'FOO:BAR,SPAM'\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      2\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mre\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 3\u001b[1;33m \u001b[0mre\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'[:,]'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdata\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;32mc:\\python36\\lib\\re.py\u001b[0m in \u001b[0;36msplit\u001b[1;34m(pattern, string, maxsplit, flags)\u001b[0m\n\u001b[0;32m    210\u001b[0m     \u001b[1;32mand\u001b[0m \u001b[0mthe\u001b[0m \u001b[0mremainder\u001b[0m \u001b[0mof\u001b[0m \u001b[0mthe\u001b[0m \u001b[0mstring\u001b[0m \u001b[1;32mis\u001b[0m \u001b[0mreturned\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0mthe\u001b[0m \u001b[0mfinal\u001b[0m \u001b[0melement\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    211\u001b[0m     of the list.\"\"\"\n\u001b[1;32m--> 212\u001b[1;33m     \u001b[1;32mreturn\u001b[0m \u001b[0m_compile\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mpattern\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mflags\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mstring\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmaxsplit\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    213\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    214\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0mfindall\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mpattern\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mstring\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mflags\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mTypeError\u001b[0m: cannot use a string pattern on a bytes-like object"
     ]
    }
   ],
   "source": [
    "data = b'FOO:BAR,SPAM'\n",
    "import re\n",
    "re.split('[:,]', data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在文本字符串上的操作均可用于字节字符串。但字节字符串的索引操作返回整数而不是单独字符。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 181,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'H'"
      ]
     },
     "execution_count": 181,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a = 'Hello World'\n",
    "a[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 182,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'e'"
      ]
     },
     "execution_count": 182,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 183,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "72"
      ]
     },
     "execution_count": 183,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b = b'Hello World'\n",
    "b[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 184,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "101"
      ]
     },
     "execution_count": 184,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这种语义上的区别会对于处理面向字节的字符数据有影响。\n",
    "\n",
    "字节字符串不会提供一个美观的字符串表示，也不能很好的打印出来，除非它们先被解码为一个文本字符串。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 185,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "b'Hello World'\n"
     ]
    }
   ],
   "source": [
    "s = b'Hello World'\n",
    "print(s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 186,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hello World\n"
     ]
    }
   ],
   "source": [
    "print(s.decode('ascii'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "类似的，也不存在任何适用于字节字符串的格式化操作："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 187,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "b'      ABCD        100     490.10'"
      ]
     },
     "execution_count": 187,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b'%10s %10d %10.2f' % (b'ABCD', 100, 490.1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 188,
   "metadata": {},
   "outputs": [
    {
     "ename": "AttributeError",
     "evalue": "'bytes' object has no attribute 'format'",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-188-de541e2ee6c2>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[1;34mb'{} {} {}'\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34mb'ABCD'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;36m100\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;36m490.1\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;31mAttributeError\u001b[0m: 'bytes' object has no attribute 'format'"
     ]
    }
   ],
   "source": [
    "b'{} {} {}'.format(b'ABCD', 100, 490.1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果想格式化字节字符串，得先使用标准的文本字符串，然后将其编码为字节字符串。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 189,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "b'ABCD              100     490.10'"
      ]
     },
     "execution_count": 189,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'{:10s} {:10d} {:10.2f}'.format('ABCD', 100, 490.1).encode('ascii')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "最后需要注意的是，使用字节字符串可能会改变一些操作的语义，特别是那些跟文件系统有关的操作。比如，如果你使用一个编码为字节的文件名，而不是一个普通的文本字符串，会禁用文件名的编码/解码。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 190,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['.ipynb_checkpoints',\n",
       " 'AIScanner API接口使用文档.ipynb',\n",
       " 'example.txt',\n",
       " 'filename',\n",
       " 'jalapeño.txt',\n",
       " 'MovieLens 1M数据集.ipynb',\n",
       " 'movies.dat',\n",
       " 'Pyhton数据结构和算法.ipynb',\n",
       " 'Python字符串和文本.ipynb',\n",
       " 'ratings.dat',\n",
       " 'users.dat',\n",
       " '来自bit.ly的1.usa.gov数据.ipynb']"
      ]
     },
     "execution_count": 190,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "with open('jalape\\xf1o.txt', 'w') as f:\n",
    "    f.write('spicy')\n",
    "    \n",
    "import os\n",
    "os.listdir('.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 191,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[b'.ipynb_checkpoints',\n",
       " b'AIScanner API\\xe6\\x8e\\xa5\\xe5\\x8f\\xa3\\xe4\\xbd\\xbf\\xe7\\x94\\xa8\\xe6\\x96\\x87\\xe6\\xa1\\xa3.ipynb',\n",
       " b'example.txt',\n",
       " b'filename',\n",
       " b'jalape\\xc3\\xb1o.txt',\n",
       " b'MovieLens 1M\\xe6\\x95\\xb0\\xe6\\x8d\\xae\\xe9\\x9b\\x86.ipynb',\n",
       " b'movies.dat',\n",
       " b'Pyhton\\xe6\\x95\\xb0\\xe6\\x8d\\xae\\xe7\\xbb\\x93\\xe6\\x9e\\x84\\xe5\\x92\\x8c\\xe7\\xae\\x97\\xe6\\xb3\\x95.ipynb',\n",
       " b'Python\\xe5\\xad\\x97\\xe7\\xac\\xa6\\xe4\\xb8\\xb2\\xe5\\x92\\x8c\\xe6\\x96\\x87\\xe6\\x9c\\xac.ipynb',\n",
       " b'ratings.dat',\n",
       " b'users.dat',\n",
       " b'\\xe6\\x9d\\xa5\\xe8\\x87\\xaabit.ly\\xe7\\x9a\\x841.usa.gov\\xe6\\x95\\xb0\\xe6\\x8d\\xae.ipynb']"
      ]
     },
     "execution_count": 191,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "os.listdir(b'.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "一些程序员为了提升程序执行的速度会倾向于使用字节字符串而不是文本字符串。尽管操作字节字符串确实会比文本更加高效（因为处理文本固有的Unicode相关开销）。这样做通常会导致非常杂乱的代码。会经常发现字节字符串并不能和Python的其他部分工作的很好，并且还得手动处理所有的编码/解码操作。如果在处理文本，最好直接在程序中使用普通的文本字符串而不是字节字符串。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
